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Abstract 


The  precision  of  item  parameter  estimates  can  be  increased 
by  taking  advantage  of  dependencies  between  the  latent  proficiency 
variable  and  auxiliary  examinee  variables  such  as  age,  courses 
taken,  and  years  of  schooling.  Gains  roughly  equivalent  to  two  to 
six  additional  item  responses  can  be  expected  in  typical 
educational  and  psychological  applications.  Empirical  Bayes 
computational  procedures  are  presented,  and  illustrated  with  data 
from  the  Profile  of  American  Youth  survey. 


Key  words:  EM-algorithm ,  empirical  Bayes,  marginal  maximum 
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Exploiting  Auxiliary  Information  about  Examinees  in  the 
Estimation  of  Item  Parameters 
A  pervasive  problem  in  item  response  theory  (IRT)  is  the 
difficulty  of  simultaneously  estimating  large  numbers  of  parameters 
from  limited  data.  Even  large  samples  of  examinees  may  not 
eliminate  the  problem  when  each  examinee  responds  to  only  a  few 
items,  as  in  educational  assessment  and  adaptive  testing.  Certain 
improvements  are  obtained  by  using  hierarchial  models  along  the 
lines  of  Lindley  and  Smith  (1972);  treating  examinee  parameters  as  a 
sample  from  a  common  population  enhances  the  stability  and  precision 
of  item  parameter  as  well  as  examinee  parameter  estimates.  This 
approach  has  been  applied  to  IRT  by  a  number  of  researchers 
recently,  including  Bock  and  Aitkin  (1981),  Leonard  and  Novick 
(lqR5),  Rigdon  and  Tsutakawa  (1982),  and  Swaminathan  and  Gifford 
(  1982)  . 

For  the  most  part,  the  aforementional  writers  consider  all 
examinees  to  be  members  of  a  single,  undifferentiated,  population. 
This  framework  instantiates  such  beliefs  as,  "if  the  parameters 
of  most  examinees  seem  to  lie  between  -3  and  +3,  then  the 
parameter  of  an  examinee  who  answered  both  of  two  hard  math  items 
correctly  is  probably  somewhere  between  +1.5  to  +3.5 — even  though 
his/her  maximum  likelihood  estimate  is  +»."  Additional  stability 
and  precision  may  yet  be  achieved  if  auxiliary  information  is 
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available  about  examinees,  such  as  educational  background  or  status 
on  demographic  variables.  A  statement  like  "the  parameter  of  an 
examinee  who  answered  both  of  two  hard  math  items  correctly  and 
studied  calculus  in  college  is  probably  between  +2.7  and  +3.7," 
might  result. 

This  paper  addresses  the  utilization  of  auxiliary  information 
about  examinees  in  estimating  item  parameters.  The  following 
section  reviews  item  parameter  estimation  when  examinee  parameters 
are  known,  then  when  examinee  parameters  are  unknown  and  nothing 
is  assumed  about  them.  Attention  then  turns  to  the  additional 
assumptions  of  first,  an  undifferentiated  population,  and  second, 
a  population  differentiated  with  respect  to  auxiliary  variables. 
Following  this  are  sections  that  discuss  anticipated  gains  in 
precision,  outline  computational  procedures,  and  illustrate  the 
approach  with  responses  to  four  items  from  the  Arithmetic  Knowledge 
subtest  of  the  Armed  Services  Vocational  Aptitude  Battery. 

The  Role  of  Auxiliary  Information 
The  relevance  of  auxiliary  examinee  variables  to  item 
parameter  estimation  is  not  immediately  obvious,  since  they  play 
no  role  in  the  basic  model  for  item  responses.  Letting  x^  »  (*n> 
...,x^n)  represent  the  responses  of  examinee  i  to  n  test  items  and 
y^  represent  values  of  auxiliary  variables  such  as  educational  and 
demographic  status,  the  standard  IRT  assumption  of  local 
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independence  states  that 

n 

P(xi  |ei,y1,6)  =n  ,  (1) 

where  0^  is  the  examinee  parameter,  p  =  (pj,...,0n)  are  possibly 
vector-valued  item  parameters,  and  the  form  of  p(Xjj|0j,p^)  is 
specified  a  priori  through  the  item  response  model.  It  follows  that 
y^  would  indeed  be  irrelevant  to  item  parameter  estimation  if  0^ 
were  known.  The  likelihood  to  be  maximized  with  respect  to  0, 
given  the  data  matrix  X  =  (Xj,...,x^)  of  responses  from  N  examinees 
with  proficiencies  0  =  (0j,...,0N)  and  auxiliary  variables  Y  = 

( y j  ,... ,yN),  would  be  simply 

N 

L  =  IT  p(x  le  ,6)  .  (2) 

i  - 

The  maximum  likelihood  estimate  (MLE)  6  would  satisfy  the 
likelihood  equations 

0  =  Z  9i.(e,  )/98  ,  (3) 

i 

where  £^(0)  =  log  p(Xj|0,p),  and  the  covariance  matrix  of 
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estimation  error  variances  for  8  could  be  approximated  by  the 
inverse  of  the  observed  information  matrix  I: 


(  _ _ _ L_  v  _ i _ L_  i 

»  rk  o  )  '  t  ) 


3B 


B“8 


(4) 


But  Equation  1  gives  response  probabilities  conditioned  on  e, 
and  0  is  not  known  in  practice*  The  problem  that  must  actually  be 
solved  is  to  maximize  the  marginal  likelihood 


-  n  /  p(xi|e,e)  dF^e) 


(5) 


where  F^Cq)  is  the  distribution  of  the  unknown  proficiency  of 
examinee  i.  This  is  an  "incomplete  data"  problem,  in  the 
terminology  of  Dempster,  Laird,  and  Rubin  (1977),  corresponding 
to  the  "complete  data”  problem  of  maximizing  Equation  2  when  6 
is  known.  Assuming  the  required  integrals  exist,  the  likelihood 
equations  become 


0  «  E  p'1^)  f  l3t1(e)/9B]  dF^e) 


where 
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Pi(Xi)  =  /  ^i(e)dF1(e) 

Louis  (1982)  shows  that  if  Zack's  (1971,  Chapter  5)  regularity 
conditions  are  met  and  if  is  known  for  all  i,  the  diagonal 
elements  of  the  incomplete-data  observed  information  matrix, 
namely 


}  dF1(0)  ,  (6) 

8=8 
*»  *•» 

cannot  exceed  the  diagonal  elements  I  .  In  other  words,  the 

0 

A 

precision  with  with  elements  of  f?  would  be  estimated  if  0  were  known 
provides  an  upper  limit  to  the  precision  to  be  expected  when  0  is 
not  known  but  must  be  inferred . 

A  similar  phenomenon  arises  in  the  context  of  sample  survey 
analysis  when  a  clustered  sampling  design  is  employed  to  estimate 
a  mean.  If  n  units  are  sampled  from  each  of  N  randomly-selected 
clusters,  then  the  squared  standard  error  of  the  mean,  ignoring 


=  I 
i 


P*1^)  /  {( 


8£i(0) 

38 


3 £. (0) 

N-V-) 


finite  population  corrections,  is  given  as 
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,  2 

SEMZ  =  ^  [1  +  (n  -  l)p] 

2 

where  a  is  the  population  variance  and  p  is  the  intraclass 

correlation  coefficient  indicating  within-cluster  homogeneity.  If 

the  number  of  clusters  (N)  is  held  constant,  increasing  the  sample 

2  2 

size  (n)  within  clusters  cannot  decrease  SEM  below  pa  /N,  the 
2 

the  value  of  SEM  obtained  when  the  means  of  the  sampled  clusters 
are  known  without  error. 

The  estimation  of  8  in  the  context  of  IRT  must  also  deal  with 
uncertainty  from  two  sources.  First  is  the  usual  limitation  of 
having  data  from  only  a  finite  sample  of  examinees.  All  other 
conditions  remaining  unchanged,  increasing  N  leads  to  greater 

A 

precision  for  8  •  Second  is  the  limitation  that  0  remains  unknown 
even  for  sampled  examinees.  For  a  fixed  sample  of  examinees, 

A 

reducing  uncertainty  about  9  leads  to  greater  precision  for  0» 

This  can  be  achieved  through  (i)  item  responses,  (ii)  assumptions 
about  the  F^ ' s  and  (iii)  auxiliary  variables  related  to  0. 

de  Leeuw  and  Verhelst  (1984)  point  out  that  finding  maxima  in 
terms  of  8  arid  of  each  individual  9^  in  the  manner  suggested  by 
Birnbaum  (1968)  is  equivalent  to  maximizing  Equation  5  when  each 
F^  concentrates  its  mass  at  the  single  (unknown)  point  0^.  This 
joint  maximum  likelihood  (JML)  solution  utilizes  only  information 
in  responses  x^  from  examinee  i  to  reduce  uncertainty  about  0^. 
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Alternatively,  one  may  consider  the  6's  to  be  identically 
distributed,  so  that  =  F  for  all  i.  An  auxiliary  variable  y 
is  thereby  implied  for  all  examinees,  an  indicator  signifying  that 
each  is  a  member  of  the  population  whose  distribution  is  specified 
by  F.  Appearing  in  the  literature  are  treatments  that  assume  a 
completely  specified  form  for  F  (e.g.,  Bock  &  Lieberman,  1970), 
others  that  assume  parametric  forms  with  unknown  parameters  a  to 
be  estimated  along  with  g  (e.g.,  Zwarts  &  Veldhuesen,  1985),  and 
still  others  that  provide  nonparametric  approximations  (e.g., 

Tjur,  1982).  Under  the  first  of  these  three  approaches,  the 
assumed  population  distribution  combines  with  x^  to  produce 
p(01 |X) ,  which  in  this  case  equals  p(0j|x^).  Under  the  latter 
two  approaches,  responses  from  examinees  other  than  examinee  i 
also  play  a  role  in  estimating  F  so  that  pO^lx^)  *  p(0i|x). 

A  third  alternative,  falling  between  unique,  unconstrained 


F^'s  and  identical  F^'s,  is  to  posit  distributions  that  depend 


auxiliary  variables:  that  is,  F  (0)  =  F  (0).  Examinees  with 

1  yi 

identical  y  values  are  considered  a  random  sample  from  a 


population  indexed  by  that  particular  value  of  y,  and  these 


conditional  distributions  are  allowed  to  vary  with  y.  A  following 


section  gives  details  for  two  special  cases,  namely  a  linear  model 
and  a  (quasi-)  nonparametric  mixture  approximation. 
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Several  factors  contribute  to  the  magnitude  of  the  precision 
gains  that  can  be  achieved  through  population  assumptions  and 
auxiliary  variables.  One  factor  is  the  sensitivity  of  different 
model  parameters  to  missing  information.  Mislevy's  (1984)  analysis 
of  Bock  and  Lieberman's  (1970)  LSAT  data  showed  that  estimates  of 
the  population  variance  were  more  substantially  improved  by 
increases  in  test  length  than  were  estimates  of  the  population 
mean.  This  might  lead  one  to  expect  increased  information  about  0 
to  have  more  effect  on  item  slopes  than  on  item  thresholds  in  the 
context  of  item  parameter  estimation. 

A  second  factor  is  the  nature  of  the  joint  distribution  of 
auxiliary  variables  with  0 .  An  auxiliary  variable  adept  at 
identifying  low  proficiency  examinees,  for  example,  adds 
information  for  those  examinees  most  useful  for  estimating  lower 
asymptote  item  parameters. 

A  third  factor  is  the  dependence  of  the  estimated  information 
upon  estimated  parameter  values.  Although  a  slope  parameter 
may  be  consistently  estimated  under  both  the  undifferentiated  and 
undifferentiated  population  models,  a  higher  estimate  under  the 
latter  may  appear  less  precise.  This  is  because  estimated  standard 
errors  for  slopes  are  directly  proportional  to  the  values  of  the 
slope  estimates,  even  though  true  standard  errors  depend  on  true 
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slope  values  and  not  their  estimates.  A  slope  estimated  with  the 
aid  of  auxiliary  variables  and  obtaining  a  higher  estimate  can 
thus  have  a  lower  true  standard  error  but  a  higher  estimated 
standard  error. 

Since  the  same  factors  determine  information  gain  from  both 
increased  test  length  and  auxiliary  variables,  however,  it  is 
reasonable  to  consider  the  contribution  of  auxiliary  variables  in 
units  of  additional  item  responses.  In  the  special  case  of 
dichotomous  items,  the  amount  of  information  conveyed  by  item 
responses  alone  is 

p  ’  ( e ) 2 

1(6)  =  j  pj(0)l*  "  pi(e)1  ’ 

where  (0 )  =  p(x^  =  1 | 9 )  and  P^(9)  =  dP^(0)/d0.  For  examinees 

with  finite  maximum  likelihood  estimates,  Bayes  theorem  applied 

with  a  diffuse  prior  leads  to  the  approximation  p(0 | x ^ ~  N(0,ax> 

2  -1 

with  ox  =  i  •  This  follows  by  first  rescaling  the  likelihood  so 
that  it  integrates  to  one,  then  using  its  mode  and  curvature  at  the 
mode  in  a  normal  approximation. 

Consider  as  an  example  the  two-parameter  logistic  model, 
under  which  P^(9)  =  p(x  =  1 1 0  ,a j  ,b ^ )  =  1 / { 1  +  exp[-1.7a^(0  -  b^)]}. 


1WB ■  Uf'U  WJ»PI  W ISW  WW  WT.  V  f  K"i.m  77  V v  «  v/  y« 7 
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The  contribution  of  item  j  to  information  about  6  is 
2 

2.89a  P^ (8)  [1  -  P  (8)],  and  the  total  information  from  n  identical 

2 

items  for  which  bj  =  8  and  a^  =  a  is  simply  0.7225  na  .  Table  1 

2 

gives  values  of  i  and  ox  in  this  simple  case  for  selected  test 
lengths  and  values  of  a.  Note  that  where  1.7a  =  1.0  (i.e., 
a  *  .588,  corresponding  to  an  item  trait  correlation  of  .7071  in  a 
standard  normal  population),  four  additional  items  provide  a  unit 
gain  in  precision.  The  results  provide  an  indication  of  the  amount 
of  information  about  0  that  is  employed  in  JML  estimation  of  item 
parameters.  It  is  apparent  that  as  test  length  increases, 
information  (i.e.,  precision)  increases  at  a  constant  rate  and  the 
posterior  variance  decreases  at  a  decreasing  rate. 


Insert  Table  1  about  here 


The  magnitude  of  gain  in  information  about  0  obtained  by 

assuming  an  undifferentiated  population  (i.e.,  =  F)  can  be 

gauged  by  extending  the  approximation  employed  for  Table  1.  If  the 

normalized  likelihood  function  induced  by  x^  is  again  approximated 
*  2 

as  N(6 ,ox)  and  if  it  is  further  assumed  that  examinee  i  has  been 

2 

selected  at  random  from  a  population  in  which  8  ^  N(y,o  ),  then 
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'  -2 

6o  +  uo 

X 


and 


Z 


-2.-1 
°x  > 
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Table  2  shows  values  of  the  reciprocal  of  l  (i.e.,  "precision”) 
from  various  test  lengths  with  identical  items  with  1.7a  *  1  and  a 
standard  normal  prior  for  0.  Note  that  for  each  test  length,  a 
unit  gain  in  precision  is  achieved  over  the  1.7a  *  1  column  of 
Table  1.  These  tabled  values  fall  within  the  ranges  encountered  in 
applied  work,  and  suggest  that  the  assumed  distribution  contributes 
about  as  much  information  about  0  as  four  additional  items.  The 
corresponding  value  for  1.7a  *  .5  is  sixteen  items,  and  that  for 
1.7a  =  1.5  is  about  one  item.  Since  the  absolute  contribution  is 
constant  with  respect  to  increasing  test  length,  the  relative 
contribution  declines. 

To  gauge  the  additional  impact  of  differentiating  the 
population  through  auxiliary  variables,  we  may  consider  numerical 
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values  resulting  from  a  regression  model  with  homoscedastic 

residuals.  Suppose  y  values  account  for  (100  x  z)-percent  of  the 

variance  in  a  population  with  total  variance  1.0,  so  that 

F  (0 )  ~  N(p  , a 2)  with  =  1  -  r.  If  the  normalized  likelihood 
y  y  e  e 

•»  2 

induced  by  item  responses  is  approximately  N(0,ox),  then 


p(Q  |x± >y1)  z  N  [ 


,  -2 
+  p  a 

y  e 


+  c 


-2 


Using  the  same  simplified  item  response  model  and  'a'  value  as 
Table  2,  Table  3  compares  values  of  the  inverse  of  the  posterior 
variance  for  9  as  determined  by  (i)  item  responses  alone,  (ii)  with 
knowledge  of  membership  in  an  undifferentiated  population  with  unit 
variance,  and  (iii)  with  the  additional  knowledge  of  auxiliary 
variables  that  account  for  successively  greater  proportions  of 
total  variance.  Values  between  10-  and  40-percent,  a  range  typical 
of  educational  and  psychological  work,  increase  information 
(posterior  precision)  about  0  by  amounts  roughly  equivalent  to  one 
to  three  additional  item  responses.  For  items  with  1.7a  =  .5, 
gains  in  item  units  would  be  doubled;  for  items  with  1.7a  =  1.5, 
gains  in  item  units  would  be  halved. 


Insert  Tables  2-3  about  here 
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The  Ignorability  of  p(y) 


This  section  demonstrates  that  under  reasonable  assumptions. 


the  population  distribution  of  y  can  be  ignored  for  the  purposes  of 


estimating  item  parameters  g  and  population  parameter  a. 


Suppose  that  the  distribution  of  y  in  a  population  of  examinees  is 


governed  by  the  density  function  p(y|y),  which  depends  on  possibly 


unknown  parameters  y  but  not  upon  item  parameters  g  nor  on  the 


parameters  a  of  the  conditional  distributions  f(6|y,a).  The 


probability  of  observing  the  data  matrix  (X,Y)  from  a  random 


sample  of  N  examinees  is  given  by 


P  ( X ,  Y I  g  ,a  ,y  ) 


=  n  /  p(xt  |e  ,yt  ,g  ,a,y)  p(8  |yt  ,B  ,a,y)  ply^g.a.y)  d0 


=  n  f  p(xije,g)  plejyj.a)  ply^y)  d9 


{  it  /  p(x.  |e,g)  p(e|y.  ,o)  de}  x  {  n  p(yjy)} 

i  -  -  i 


=  P(X | Y,g  ,a  )  P( Y I y ) 
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Likelihood  inferences  about  a  and  (5  are  therefore  independent  of 
inferences  about  y,  and  the  conditional  MLE's  of  a  and  6  given  Y 
are  identical  to  MLE's  obtained  jointly  with  y. 

Models  and  Methods 

This  section  presents  two  IRT  models  that  differentiate 
examinees  by  means  of  auxiliary  variables,  and  suggests  computing 
approximations  based  on  Bock  and  Aitkin's  (1981)  marginal  maximum 
likelihood  (empirical  Bayes)  procedures. 

Mixtures  of  Finite  Distributions 


Mislevy  (1984)  decribes  a  nonparametric  approximation  of  a 
continuous  density  function  of  a  latent  variable  in  terms  of  a 
distribution  with  mass  at  a  finite  number  of  prespecified  points. 
The  proficiency  of  each  examinee,  or  0^  then,  is  assumed  to  take 
one  of  only  Q  known  values.  The  "latent  trait"  problem  is  thereby 
replaced  by  an  analogous  "latent  class"  problem  that  is  easier  to 
solve.  A  single  population  was  addressed  in  that  presentation, 
and  item  parameters  were  assumed  known.  We  now  consider  extensions 
to  the  simultaneous  estimation  of  item  parameters,  and  to  multiple 
subpopulations  indexed  by  an  auxiliary  variable  y.  This  approach 


provides  considerably  flexibility  in  the  distributions  F^(9)  ■ 


It  lends  itself  well  to  discrete  auxiliary  variables  with 


relatively  few  values. 
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It  proves  convenient  to  write  such  an  auxiliary  variable  as 

a  vector  of  0/1  indicators.  Define  y^  =  j * • • • ^y  letting 

y. .  =  1  if  examinees  i  is  associated  with  the  k'th  of  K  exhaustive 
^ik 

and  mutually  exclusive  subpopulations,  and  zero  otherwise.  The 
probability  of  observing  response  pattern  x^  from  an  examinee 
selected  at  random  from  a  specified  subpopulation  is  given  by 


p(x1)yi,6)  =  II  If  p(x1|e,6)  dFk(8)} 
~  ~  ~  k 


where  is  the  distribution  in  subpopuation  k.  This  probability 
can  be  approximated  by  a  finite  distribution  as 


-  ,  yik 

p(*i|yi»e)  “  n  {  z  p(x1|o  ,b)w  }  (8) 

’  k  q  ~  q  ~  ^ 

where  ©j,...,©^  is  a  grid  of  points  and  W^k  is  the  weight  or 
density  at  point  q  in  subpopulation  k.  The  weights  W  play  the 
role  of  a  in  earlier  notation.  For  the  remainder  of  this 
subsection,  we  limit  our  attention  to  distributions  of  the  form 
of  the  right-hand  side  of  Equation  8.  As  demonstrated  above,  we 
may  carry  out  the  estimation  of  6  and  W  conditional  on  Y. 
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Let  (X,Y)  be  the  data  matrix  observed  from  a  sample  of  N 
examinees  selected  either  randomly  from  the  population  as  a  whole 
or  as  random  subsamples  stratified  on  y.  The  probability  of  X 
given  Y  is  proportional  to 


L 


M 


yik 

=  n  n  {  i  p(x1|eq,8)wqk} 


and  its  logarithm  is 

‘m  "  lo*  ht 

-  £  £  ylk  log  z  p(*t  |eq,e>  Kqk 

Relative  maxima  with  respect  to  8  and  W  can  be  obtained  by  means 
of  the  EM  algorithm,  under  the  special  case  of  missing  indicators 
for  a  multinomial  distribution  (Dempster  et  al.,  1977,  Section 
4.3).  The  expectation  step  of  cycle  t  +  1  computes  expected 


values  of  the  following  quantities: 


% 


xi 
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The  expected  number  of  examinees  with  proficiency  0^  from  a 


sample  of  size  from  subpopulation  k,  conditional  on  X,  Y, 


t  A  t 

6  ,  and  W  : 


nV1  =  £  y.,  p5(0  |x  )  , 

qk  .  ikrk  q.i 


where 


Pk(0ql!ci)  =  p(^ilQq’!  =  ^"Jk7  Z  p(*il°r'?  =  PC)Wrk  ’ 


an  application  of  Bayes  theorem,  gives  the  posterior 


probability  that  the  proficiency  of  examinee  i  is  0^ ,  given 


*  t  *  t 

provisional  parameter  estimates  B  and  W  , 


The  expected  number  of  correct  responses  to  item  j  from 


examinees  in  subpopulation  k  with  proficiency  0^,  given  a 


*  t  *  t 

random  sample  of  size  (again  given  6  and  W  ): 


^jqk  *  ^  ^ikxl j  ^k^ q  ^*1  ^  ' 


0  *  4  • 

*>  W  '/  /  V  /  /;v  „•  --  /  w.  ✓• .  >*%  **  .**»*.  •  .  ■  ,  *»  *  .  *►  '  •  *  a.*  -  '  « 
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The  maximization  step  computes  what  would  be  MLE's  of  g  and 

A  A 

W  if  N  and  R  were  observed  quantities  rather  than  conditional 
expectations.  For  W,  we  have  simply 


W 


t+1 

qk 


^l/Nk 


For  g,  we  solve  conditional  expectations  of  likelihood  equations: 


0 


E 

q 


R‘+y+lp.(G ) 

Jq+ q+  j  q 

pj (®q)[i  -  VV] 


3  p. (0  ) 

_ J _ 9_ 

36 


(9) 


where  R,+^  *  E  Rj+3  and  is  similarly  defined.  Under  the  2- 

Jq+  k  Jqk  q+ 

parameter  logistic  model,  for  example,  Equation  9  simplifies  as 
follows : 


-  ‘Cwi(e<  -  v 


'RJqi  -  CWh 


In  principle,  the  linear  indeterminacy  in  the  1-,  2-,  and  3- 
parameter  logistic  and  normal  IRT  models  presents  no  impediment  to 
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the  EM  algorithm,  which  readily  converges  to  one  of  the  infinitely 
many  solutions  on  a  ridge.  Numerical  stability  and  the  quality  of 
the  finite  characterization  of  F  are  enhanced,  however,  by 
controlling  the  scaling  of  the  solution  at  this  point.  One 
convenient  way  of  doing  so  is  to  standardize  the  weighted  average 
distribution.  We  have  referred  to  the  points  0^  as  specified  a 
priori;  given  the  linear  indeterminacy,  we  may  conceive  of  only 
their  relative  spacing  as  prespecified.  After  each  EM  cycle,  then, 
we  may  rescale  the  points  as  follows: 


0  =  (0  -  0)/s 

q  q 


where 


0  =  N_1  E  N.  E  0  W* 

k  k  q  q qk 


s  =  N'1  I  N,  I  (0  -  0)2 

,  k  q  qk 

k  q 


Item  parameters  are  adjusted  accordingly*  Under  the  2-  and  3- 


where  C© q I x ^ ^  *s  evaluated  at  6  and  W  . 

A  Linear  Model 

The  unrestricted  mixture  solution  described  above  becomes 
unwieldly  as  the  number  of  potential  values  of  the  auxiliary 
variable  increases.  The  more  structured  alternative  of  a  linear 
model  for  p(0  |y)  is  suitable  when  y  is  vector-valued  or  is 
continuo  rather  than  discrete.  Assuming  homoscedastic  and  normal 


residuals,  we  would  have 
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where  auxiliary  variables  are  coded  so  that  the  K  columns  of  Y  = 

(y.  '.  ...  I  y  )',  which  are  basis  vectors  for  the  K  elements  of  a, 

« 1  •  *  w  K  ««, 

are  linearly  independent.  They  may  include  values  on  measured 
variables  such  as  previous  test  scores  and  dummy  regression 
variables  that  encode  selected  contrasts  among  categorical 
auxiliary  variables. 

2 

Maximum  likelihood  solutions  for  a  and  o  in  the  special  case 

of  structured  means  for  the  cells  of  a  multi-way  design  have  been 

given  by  Mislevy  (1985)  under  the  assumption  that  item  parameters 

are  known,  and  by  Zwarts  and  Veldhuesen  (1985)  under  the  assumption 

that  p(x|0)  is  the  Rasch  model  with  unknown  item  parameters  to  be 

estimated  jointly.  These  solutions  are  readily  extended  to  the 

case  of  a  general  IRT  model  with  unknown  item  parameters.  This 

section  describes  an  approximation  over  a  grid  of  prespecified 

points  so  that  computation  is  similar  to  the  nonparametric  solution 

described  above.  Attention  is  focused  for  convenience  upon  the  1-, 

2-,  and  3-parameter  logistic  and  normal  IRT  models. 

The  linear  indeterminacies  of  these  models  are  again 

conveniently  resolved  by  restrictions  on  the  population  parameters. 

2 

First,  we  may  without  loss  of  generality  fix  o  at  unity  to  set  the 
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unit-size  of  the  scale.  For  1-parameter  models,  a  slope  parameter 
common  over  items  is  then  estimated.  Second,  we  may  set  the 
origin  by  centering  the  elements  of  each  column  of  Y  at  zero.  All 
effects  are  thus  cast  as  deviations  around  a  grand  mean  of  zero. 
This  restriction,  in  conjunction  with  the  independence  of  the  basis 
vectors,  completes  the  resolution  of  the  scale. 

The  marginal  likelihood  for  a  sample  of  size  N  is  written  as 

L  =  II  /  p(x1|e,6)  0(9  -  yja)  d0  , 


where  0  represents  the  standard  normal  density  function. 
Approximation  over  a  finite  grid  of  points  is  accomplished  by 


l*  =  n  z  p(xt  |o  ,e)  w 

i  q  '  q  ' 


» 


where 


V?> 


exp[-(G  -  yJot)2/2]/  Z  exp[-(0r  -  y|a)2/2] 
q  “  ~  r  ~  “ 


The  weights  W  play  the  same  role  as  those  in  the  preceeding 
approximation.  The  difference  is  that  they  are  no  longer  estimated 
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without  restriction,  but  modeled  as  functions  of  the  effect 
parameters  a. 

MML  estimation  can  again  proceed  in  EM  cycles  that  solve  the 

*  t  />  t 

likelihood  equations.  Let  E  and  a  be  provisional  estimates  from 
cycle  t.  The  E-step  computes  expected  counts  of  examinees  and 
correct  responses  at  each  point: 

Nq+1  -  Z  P(0q  |x1  .E^a*) 


and 


x,  ,P(0 
ij  q 


At 

>a 


) 


where 


p(Oq|x1,Et,at)  =  p(xi|0q,Et)W  (aC)/  Z  p(Xj 


|Gr,et)Wri(aC) 


It  also  computes  the  conditional  expected  value  of  each  examinee's 
proficiency: 


^  i  1  ^  k  a  *. 

e[  =  z  0 qp(0 ql*t.e  .<*  ) 

q  q  q  -  -  . 


■  r.  /t  ^  w"t  ***  ^ d  -f.  .  ,  ■f  .  »  ,  *  <  '  -  *  ’  .  ■  •  «  •  •  • 
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The  M-step  pseudo-likelihood  equations  for  item  parameters  can 
be  written  as  in  Equation  9.  The  equations  for  a  simplify  to 


"  t+1 
a 


(Y'Y) 


t 


*  t  +  1  A  t+1  *  t+1 

where  6  =  (6 ^  ,...,8^  ).  The  posterior  information  matrix  for 

0  can  again  be  approximated  via  Equation  10. 

A  Numerical  Example 

This  section  illustrates  the  procedures  described  above.  The 
data  are  responses  to  four  items  from  the  Arithmetic  Reasoning  test 
of  the  Armed  Services  Vocational  Aptitude  Battery  ( ASVAB) ,  Form  8A, 
as  observed  in  a  sample  of  776  participants  in  the  Profile  of  American 
Youth  survey  (U.S.  Department  of  Defense,  1982).  Table  4  gives 
counts  of  the  sixteen  possible  response  patterns  occurring  in  each 
cell  of  a  2-by-2  design  based  on  two  background  variables  collected 
along  with  item  responses.  Because  these  variables  are  based  on 
demographic  information  rather  than  the  educationally-relevant 
information  we  would  prefer,  we  will  refer  to  the  factors  as  simply 
Factor  A  and  Factor  B,  nesting  levels  1  and  2  within  each. 


Insert  Table  4  about  here 


Exploiting  Auxiliary  Information 

27 

Four  analyses  were  carried  out  on  these  data.  In  each,  the 
2-parameter  logistic  ogive  was  employed  as  the  IRT  model  for 
conditional  probabilities  of  correct  response.  The  analyses 
differed  in  terms  of  the  auxiliary  information  about  examinees  they 
employed.  The  first  run  used  MML  estimation  of  item  parameters  and 
densities  over  a  grid  of  ten  points,  assuming  examinees  were  drawn 
at  random  from  a  single  undifferentiated  population.  The  second 
and  third  runs  differentiated  the  population  via  Factor  A  and 
Factor  B  respectively,  and  the  fourth  run  employed  both  factors 
j ointly . 

Resulting  item  parameter  estimates  and  standard  errors,  along 
with  subpopulation  means  and  standard  deviations,  are  shown  in 
Tables  5  through  8.  The  scale  has  been  set  in  all  solutions  to 
standardize  the  total  population.  For  each  item  parameter  type, 
columns  in  Table  6  through  8  display  the  ratio  of  the  squared 
standard  error  of  the  item  parameter  estimate  under  the 
undifferentiated  model  to  the  corresponding  value  in  the 
differentiated  model.  The  result  can  be  interpreted  as  efficiency 
relative  to  the  undifferentiated  model,  and  the  excess  of  a  value 
above  unity  reflects  the  proportional  increase  in  estimation 
precision.  Geometric  averages  are  also  shown  for  the  relative 
efficiency  columns.  The  excess  of  such  a  value  over  unity,  times 
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four,  gives  the  increases  of  precision  in  the  units  of  numbers  of 
additional  items  of  the  same  kind. 

Insert  Tables  5-8  about  here 

It  is  apparent  that  including  auxiliary  information  had  little 
effect  on  the  values  of  the  item  parameter  estimates.  The 
differences  between  the  estimates  from  the  undifferentiated  and  the 
fully  differentiated  solutions  occur  only  in  the  second  decimal 
place.  More  significant  differences  exist  in  the  accompanying 
(estimated)  standard  errors,  however.  The  precision  of  threshold 
estimates  was  improved  only  modestly;  an  increase  roughly 
equivalent  to  one  additional  item  response  per  examinee  was 
observed  in  the  fully  differentiated  run.  The  precision  of  slope 
estimates  was  improved  dramatically;  an  increase  roughly  equivalent 
to  eight  items  was  observed.  It  would  appear  that  Factor  A 
accounted  for  more  increase  in  precision  for  slopes,  while  Factor  B 
accounted  for  more  increase  in  precision  for  thresholds. 

Discussion 

This  paper  has  outlined  procedures  for  incorporating 
auxiliary  information  about  examinees  into  the  IRT  framework. 
Enhancing  the  precision  of  item  parameter  estimates  was  the  primary 
focus.  This  section  evaluates  the  value  of  improvements  so 
attained,  and  discusses  two  additional  aspects  of  the  model. 
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The  increase  in  information  about  item  parameters  in  typical 
educational  and  psychological  settings  can  be  expected  to  lie  in 
the  range  of  two  to  six  items.  The  numerical  example  suggests  that 
the  increase  will  vary  by  item  parameter  type,  probably  less  for 
well-estimated  parameters  and  greater  for  poorly-estimated 
parameters . 

The  expected  increase  is  modest,  to  be  sure,  but  in  many 
applications  it  is  free  in  the  sense  that  it  is  already  available 
for  use.  Because  its  incremental  value  decreases  for  longer  tests, 
auxiliary  information  would  be  most  useful  in  settings  where 
relatively  few  response  are  solicited  from  each  examinee.  This 
would  include  two  applications  of  great  current  interest,  namely 
educational  assessment  and  adaptive  testing.  In  assessment,  data 
that  are  sparse  at  the  level  of  individuals — say,  five  items  in  a 
given  scale — yield  more  efficient  estimates  of  population 
parameters  for  a  given  total  number  of  item  responses.  In 
adaptive  testing,  new  items  are  calibrated  using  joint  response 
patterns  with  previously-calibrated  items  while  the  number  of  old 
items  is  held  to  minimally  acceptable  levels — as  few  as,  say, 
fifteen. 

A  side  issue  in  the  present  paper  but  a  fundamentally 
important  result  is  that  when  examinees  are  indeed  a  random  sample 
from  a  well-defined  population,  the  estimated  population 
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distributions  and  effect  parameters  are  consistent  within  the  limits 
of  precision  afforded  by  the  numerical  approximations  (see  Mislevy, 
1984,  1985,  on  population  estimation  when  item  parameters  are 
known).  This  stands  in  contrast  to  the  asymptotically  biased 
results  obtained  by  using  the  distribution  of  9  to  approximate  the 
distribution  of  0 .  In  fact,  the  discrepancy  between  the  two 
distributions  is  largest  in  exactly  those  cases  in  which  the  present 
procedures  offer  most  the  benefit  for  item  parameter  estimation, 
namely  short  tests. 

Finally,  it  is  implicit  in  preceding  discussions  that  auxiliary 
information  about  examinees  can  lead  to  improved  estimates  of 
individual  proficiencies.  Whether  estimates  that  are  improved  in 
the  sense  of  minimum  mean  squared  error  are  unequivocally  "better” 
for  all  applications  is  not  clear,  however.  We  have  avoided 
advocating  the  use  of  auxiliary  information  when  tests  are  used  as 
contests — i.e.,  when  important  placement  or  selection  decisions  are 
made  for  individual  examinees — because  it  would  seem  that  in  these 
situations  the  tester  ought  to  gather  enough  data  directly  dependent 
upon  proficiency  (i.e.,  item  reponses)  to  make  satisfactorily 
precise  decisions  on  that  strength  alone.  In  adaptive  testing,  for 
example,  we  would  recommend  the  use  of  auxiliary  information  to 
improve  item  parameter  estimation,  but  not  to  estimate  scores  that 
will  be  used  to  compare  individual  examinees. 
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Table  3 

Precision  Increases  for  0  Resulting  from  the  Use  of 
Auxiliary  Information 


Source 

Increment 
in  Posterior 
Precison 

Precision 
Gain  in 

Item  Units 

Gain  over 
Undifferentiated 
Population 

One-item  response 

.250 

1.000 

— 

Population 

membership 

1.000 

4.000 

— 

Auxiliary 

information 

R2 

= 

.10 

1.111 

4.444 

11.1% 

R2 

= 

.20 

1.250 

5.000 

25.0% 

R2 

= 

.30 

1.429 

5.716 

42.9% 

R2 

= 

.40 

1.667 

6.668 

66.7% 

R2 

= 

.50 

2.000 

8.000 

100.0% 

R2 

= 

.60 

2.500 

10.000 

150.0% 

R2 

= 

.70 

3.333 

13.332 

233.3% 

R2 

= 

.80 

5.000 

20.000 

400.0% 

R2 

= 

.90 

10.000 

40.000 

900.0% 

-X  v/  •  :x-'. 
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Item  Parameter  Estimates:  Undifferentiated  Population 
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Item 

b 

SE(b) 

a 

SE(a 

1 

-.422 

.058 

1.022 

.171 

2 

-.226 

.072 

.666 

.094 

3 

.152 

.076 

.705 

.096 

4 

.397 

.080 

.839 

.114 

Population 

Mean : 

0.000 

Population 

Standard 

Deviation : 

1.000 

r- 
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Table  6 

Item  Parameter  Estimates:  Population  Differentiated 
with  Respect  to  Factor  A  Only 


„  Relative 

Relative 

SE(b)  Efficiency 

a 

SE(a)  Efficiency 

Geometric  average 

relative  efficiency: 


1.035 


Subpopulation  means: 

Subpopulation  standard  deviations 
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Table  7 

Item  Parameter  Estimates:  Population  Differentiated 


with 

Respect  to  Factor 

B  Only 

Relative 

A 

Relative 

Item 

b 

SE(  b) 

Efficiency 

a 

SE(a) 

Efficiency 

1 

.408 

.057 

1.035 

.941 

.073 

5.487 

2 

.211 

.077 

.874 

.621 

.056 

2.818 

3 

.185 

.071 

1.146 

.686 

.058 

2.740 

4 

.431 

.064 

1.563 

.842 

.067 

2.895 

Ceometric 

average 

relative  efficiency: 

1.128 

3.328 

Subpopulation 

means : 

.136, 

-.147 

Subpopulation 

standard  deviations:  1.021, 

.955 

fc: 
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Table  8 

Item  Parameter  Estimates:  Population  Differentiated 
with  Respect  to  Factors  A  and  B 


Item 

A 

b 

SE(b) 

Relative 

Efficiency 

A 

a 

A 

SE(a) 

Relative 

Efficiency 

1 

-.421 

.052 

1.244 

1.006 

.080 

4.569 

2 

-.213 

.071 

1.028 

.672 

.059 

2.538 

3 

.139 

.065 

1.367 

.775 

.063 

2.311 

4 

.402 

.066 

1.469 

.834 

.066 

2.983 

Geometric  average 

relative  efficiency:  1.266  2.994 

Subpopulation  means:  .485,  .073,  -.513,  -.502 

Subpopulation  standard  deviations:  1.164,  .855,  .642,  .640 
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